Project - Applied Statistics

by HARI SAMYNAATH S

Part ONE

Context:
Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.

Data Description:
The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

  1. P_incidence
  2. P_tilt
  3. L_angle
  4. S_slope
  5. P_radius
  6. S_degree
  7. Class

Project Objective:
Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.

● Steps and Tasks:
1. Data Understanding:
a. Read all the 3 CSV files as DataFrame and store them into 3 separate variables.

1. Data Understanding:
b. Print Shape and columns of all the 3 DataFrames.

All data set has the same set of attributes, but number of records varies, with Type_H being the lowest in count

1. Data Understanding:
c. Compare Column names of all the 3 DataFrames and clearly write observations.

P_incidence P_tilt L_angle S_slope P_radius S_Degree Class

All three DataFrames has consistently same set of columns

1. Data Understanding:
d. Print DataTypes of all the 3 DataFrames.

Every DataFrame consists of the following data types

  1. floating point predictor variables 'P_incidence', 'P_tilt', 'L_angle', 'S_slope', 'P_radius' and 'S_Degree'
  2. object type class information in 'Class' attribute

1. Data Understanding:
e. Observe and share variation in ‘Class’ feature of all the 3 DaraFrames.

Though all datapoints in a given DataFrame corresponds to a particular class of patients, the value in the 'Class' attribute has differences of spelling/uppercase/lowercase for the same class name, vis.,
Normal <==> Nrmal
Type_S <==> tp_s
Type_H <==> type_h

This will be misinterpretted as different classes by any classifier machine learning algorithm. Hence needs to be corrected.

2. Data Preparation and Exploration:
a. Unify all the variations in ‘Class’ feature for all the 3 DataFrames.

Based on the attributes and cross referencing public dataset, these data belong to a study of orthopediac patients built by Dr. Henrique da Mota

Accordingly, let us correct the class information as follows
Normal, Nrmal ==> Normal
Type_S, tp_s ==> Spondylolisthesis
Type_H, type_h ==> Disk Hernia

2. Data Preparation and Exploration:
b. Combine all the 3 DataFrames to form a single DataFrame

2. Data Preparation and Exploration:
c. Print 5 random samples of this DataFrame

2. Data Preparation and Exploration:
d. Print Feature-wise percentage of Null values.

Each column contains only valid data, as was also mentioned in ortho.info() as 310 non-null entries against each column

2. Data Preparation and Exploration:
e. Check 5-point summary of the new DataFrame.

The features are located around a wide range of medians (11 to 120) also each attributes are at different scales and ranges, appropriate preprocessing is necessary before modelling

3. Data Analysis:
a. Visualize a heatmap to understand correlation between all features

3. Data Analysis:
b. Share insights on correlation.
i. Features having stronger correlation with correlation value.
ii. Features having weaker correlation with correlation value.

We could find low multicolinearity (attributes are not correlated heavily) as a dataset,
which is a better for regressions to explain featurewise influence on resultant

Lets mention the top 2 & last 2 pairs
the feature pairs with stronger correlation are
S_slope : P_Incidence ==> 0.81
L_angle : P_Incidence ==> 0.72

there are no significant negative correaltion pairs as strong as above

the feature pairs with weaker corrlation are
P_tilt : P_radius ==> 0.033
S_Degree : P_radius ==> -0.026

though there are lower numeric values of correlations, they are higher negative correlations, hence the above 2 qualify as the least correlated pairs

3. Data Analysis:
c. Visualize a pairplot with 3 classes distinguished by colors and share insights.

From the pairplot and kde distribution plot, we could infer the following
considerable relation pairs would be

  1. P_incidence - P_tilt
  2. P_incidence - L_angle
  3. P_incidence - S_slope
  4. L_angle - S_slope

The color grouping helps identify the following
Spondylolisthesis Class of data are significantly distanced away from the rest,
there is quite a large overlap of Normal class and Disk Hernia Class of data, based on 2D relation plots
hopefully the classes are linearly separable in higher dimensions.

3. Data Analysis:
d. Visualize a jointplot for ‘P_incidence’ and ‘S_slope’ and share insights.

as was seen earlier in correlation heat map visualisation and pairplot visualisation,
it is also being reiterated by the above jointplot that P_incidence & S_slope are exhibitting stong relationship with each other
also it could be noted that both attributes are indiviudally skewed right and are found at more frequency near their respective mode

3. Data Analysis:
e. Visualize a boxplot to check distribution of the features and share insights.

almost every feature has values centered around its median, with a few outliers,
except for S_Degree, which is heavily right skewed, with one extreme offset outlier
the same is witnessed in Coefficient of Variance value of 1.42 of S_Degree
interestingly, P_radius seems to be have outliers on either side

4. Model Building:
a. Split data into X and Y.

4. Model Building:
b. Split data into train and test with 80:20 proportion.

4. Model Building:
c. Train a Supervised Learning Classification base model using KNN classifier.

4. Model Building:
d. Print all the possible classification metrics for both train and test data.

Being a medical test use case, recall scores are to be considered a critical parameter of evaluation.
A recall value of 0.75 for Disk Hernia and 0.853 for Spondylolisthesis needs to be improved more.

5. Performance Improvement:
a. Tune the parameters/hyperparameters to improve the performance of the base model.

Accuracy : 0.823-->0.839
Spondylolisthesis recall : 0.853 --> 0.966
while above both have improved,
Disk Hernia recall has dropped 0.750 --> 0.538

given that accuracy improves while recall reduces, lets not decide on this single record deletion
after studying other options, lastly lets attempt the outlier deletion for decision making.

standardisation has improved accuracy and as a standard measure for any analytics
we would continue using standardised data

Disk Hernia is of very low proportions

clearly Normal & Disk Hernia classes are upsampled

The data balancing has helped to improve the accuracy further , and recall improvement for Disk Hernia
Lets try ADASYN balancing if better results would be arrived

ADASYN based balancing caused further loss of accuracy & recall scores
Hence we would stick with SMOTE going forwards for this usecase

Hyperparameter tuning shows results similar to non tuned model (compared to SMOTE balanced model)
in this use case our initial hyperparameters were incidentally the best
Lets see if we could further improve our results

Polynomial feturing has failed to improve reults, hence lets drop the concept
As mentioned earlier, to try outlier deletion with the tuning model, lets try

Once again Outlier deletion has not helped much

Based on the above studies,
we shall fix our option as
Standardised, SMOTE upsampled, Hyperparameter Tuned model

5. Performance Improvement:
b. Clearly showcase improvement in performance achieved.

5. Performance Improvement:
c. Clearly state which parameters contributed most to improve model performance.
What could be the probable reason?

as mentioned towards end of the study
Standardisation & SMOTE Upsampling helped improve acuracy & recall
had we chosen any other hyperparameter initially, we might have felt the significance of Hyperparameter tuning also

Reasons: obviously standardisation & upsampling release weights inherently associated with features & records
hence a noticable improvement in results were found

======================================================================================================


Part TWO

Context:
A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base where majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.

Data Description:
The data consists of the following attributes:

  1. ID: Customer ID
  2. Age Customer’s approximate age.
  3. CustomerSince: Customer of the bank since. [unit is masked]
  4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
  5. ZipCode: Customer’s zip code.
  6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
  7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
  8. Level: A level associated to the customer which is masked by the bank as an IP.
  9. Mortgage: Customer’s mortgage. [unit is masked]
  10. Security: Customer’s security asset with the bank. [unit is masked]
  11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
  12. InternetBanking: if the customer uses internet banking.
  13. CreditCard: if the customer uses bank’s credit card.
  14. LoanOnCard: if the customer has a loan on credit card.

Project Objective:
Build a Machine Learning model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.

● Steps and Tasks:
1. Data Understanding and Preparation:
a. Read both the Datasets ‘Data1’ and ‘Data 2’ as DataFrame and store them into two separate variables.

1. Data Understanding and Preparation:
b. Print shape and Column Names and DataTypes of both the Dataframes.

Both Data1 & Data2 to be considered together as a single data base,
merged based on ID column, as both has different attributes

features like ZipCode is stored in numeric type, to be changed to object (categorical) type
yet others to be studied further
aparently, there is no output feature. lets study further

1. Data Understanding and Preparation:
c. Merge both the Dataframes on ‘ID’ feature to form a single DataFrame

1. Data Understanding and Preparation:
d. Change Datatype of below features to ‘Object ‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’.

2. Data Exploration and Analysis:
a. Visualize distribution of Target variable ‘LoanOnCard’ and clearly share insights.

less than 11% of customers have loan on their credit card
while the target variable is heavily imbalanced, still there is scope to indentify potential borrowers

2. Data Exploration and Analysis:
b. Check the percentage of missing values and impute if required.

being the target variable and since only 0.4% of the dataset has missing values
let us drop the records

2. Data Exploration and Analysis:
c. Check for unexpected values in each categorical variable and impute with best suitable value.

no unxpected values in categorical (object type) columns
All values are numeric from 0 to 4 only
Note: more that 2 categories in features, ensure to use dummies on LEVEL & HIDDENSCORE fields

Lets also check numerical columns for unexpected values

surprisingly there are negative values in CustomerSince field, which is counter intuitve
lets examine the same

given that all the unexpected records have no LoanOnCard
and since relation with bank could not be negative
also since there is already a large imbalance in data, let us drop these records

3. Data Preparation and model building:
a. Split data into X and Y.

definitely the features are of different scales, it is advicable to standardise the data

3. Data Preparation and model building:
b. Split data into train and test. Keep 25% data reserved for testing.

3. Data Preparation and model building:
d. Print evaluation metrics for the model and clearly share insights.

though the accuracy is 95.8%, classification of type 1 is poor
class 1 being poorly imbalanced could be the reason

3. Data Preparation and model building:
e. Balance the data using the right balancing technique.
f. Again train the same previous model on balanced data.
g. Print evaluation metrics and clearly share differences observed.

while the accuracy has reduced from 95.8% to 88.9%
recall scores have increased, thereby reducing overfitting of the model

4. Performance Improvement:
a. Train a base model each for SVM, KNN.

4. Performance Improvement:
b. Tune parameters/hyperparameters for each of the models wherever required and finalize a model.
c. Print evaluation metrics for final model.
d. Share improvement achieved from base model to final model.

accuracy improved after tuning from 88.9% of logisticRegression to 93.6% in KNN AND SVM